# From your file menu (at the top of R-Studio) select:
# "Session -> Set working directory -> To source file location"
# Then play this chunk to get the data into R.
library(mosaic)
library(car)
library(DT)
library(pander)
library(readr)
library(tidyverse)
library(plotly)
food <- read_csv("food.csv") #food.csv is in the Data folder...


food1 <- food |> 
  mutate(exercise = case_when(
    exercise == 1 ~ "Everyday", 
    exercise == 2 ~ "2~3 Times a Week",
    exercise == 3 ~ "Once a Week")) |> 
  mutate(GPA = as.numeric(GPA), exercise = as.factor(exercise)) |> 
  na.omit(GPA)

Intro/Context

In the data set “Food” information about health habits, GPA, and basic demographic can be found. For this analysis I will only use the GPA that each person reported and how often they exercise in a week. There are three different groups 1.Everyday exercise, 2.Twice or Three Times a Week (2~3 Times a Week), and 3.Once a Week. I will measure if there is an impact on the GAP based on how often people exercise in a week regarding of their gender, height, and other information that will not be used for this analysis. The level of significance for this test will be α = 0.05.

The null hypotheses claims that all three groups have the same average GPA, represented by μ. The alternative hypotheses claims that at least one of the groups has a significantly different average GPA.

Ho/Ha (ANOVA)

\[ H_0: \mu \text {Everyday} = \mu \text{Twice or Three Times a Week} = \mu \text{Once a Week} = \mu \]

\[ H_a: \mu_i \neq \mu \] \[ α=0.05 \]

Graphical summary

ggplotly(ggplot(food1, aes(x=exercise, y=GPA)) +
         geom_boxplot(fill=c("orange", "seagreen", "royalblue"), color="black")+
  labs(title="Distribution of GPA by Group", x="Exercise Acitivity", y="GPA"))

It is interesting to see this graph because it is commonly believed that daily exercise helps with students to perform better at school, but as we can see here the median for people who only exercise 2 or 3 times a week is slightly higher than people who exercise everyday. In fact, the median GPA for people who exercise every day is closer to the one for people who only exercise one. It might be because they take more time working out than studying or it could be that they put more effort at working out than at school. With an anova test we will be able to see if the average GAP is the same for all three groups or if at least one of them has a significantly different result.

Numerical summary

favstats(GPA ~ exercise, data = food1) |> 
pander()
exercise min Q1 median Q3 max mean sd n missing
2~3 Times a Week 3 3.3 3.65 3.8 4 3.565 0.3156 16 0
Everyday 2.8 3.2 3.5 3.7 3.904 3.446 0.3313 31 0
Once a Week 2.6 3.4 3.4 3.4 3.7 3.3 0.4123 5 0

The group with the highest GAP report was the 2~3 Times a Week with a max score of 4.0, and the Everyday exercise group reported a max score of 3.904. Does that mean that Everyday exercise people are giving up 0.096 of their GAP score for working out everyday? Not necessarily, we need to do an ANOVA test to keep everything into account and see if the average differs for at least one group or if they all perform equally based on average and not only the maximum score reported.

ANOVA test

food.aov <- aov(GPA ~ exercise, data = food1) 
  pander(summary(food.aov))
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
exercise 2 0.3066 0.1533 1.374 0.2627
Residuals 49 5.466 0.1116 NA NA

The P-value for this is test is 0.2627 which is greater than the significance level used for this test of α = 0.05.

Check the requirements for the ANOVA tests

par(mfrow=c(1,2))
plot(food.aov, which= 1:2)

These two graph show the Residual vs Fitted on the left and the Normal Q-Q plot on the right. We can see that there are some outliers as shown in both graphs, and it’s becuase of these inconsistencies that a non-parametric test like Kruskal-Wallis would be more appropiate for this analysis.

Kruskal Wallis - compare results with ANOVA

kruskal.test(GPA ~ exercise, data=food1) |> 
  pander()
Kruskal-Wallis rank sum test: GPA by exercise
Test statistic df P value
2.513 2 0.2847

The P-value of the ANOVA test was p-value = 0.2627, and for the Kruskal was p-value = 0.2847. The p-value for the Kruskal test was higher for 0.022. The degrees of freedom were the same for each test, df = 2. Even though, the p-value was higher for the Kruskal test, both ANOVA and Kruskal p-values are higher than the significance level of 0.05, which means that the both will fail to reject the null hypotheses.

Conclusion/Future Studies

The ANOVA test had a p-value = 0.2627 > α = 0.05. We conclude that we fail to reject the null hypotheses, accepting that overall the three different groups have the no significantly difference on average of GPA score.

food1 |> 
    group_by(exercise) |> 
   summarise(GPA_Average = mean(GPA)) |> 
  pander()
exercise GPA_Average
2~3 Times a Week 3.565
Everyday 3.446
Once a Week 3.3

For future studies, the sample size of each group should be greater than 30. In that way we can fulfill the requirements of the anova tests and have a more consistent and large data set to analyze in more depth the hypotheses used for this analysis.